Large-Scale Similarity Join with Edit-Distance Constraints

نویسندگان

  • Chen Lin
  • Haiyang Yu
  • Wei Weng
  • Xianmang He
چکیده

In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on MapReduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance c...

متن کامل

Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints

The string similarity join, which is employed to find similar string pairs from string sets, has received extensive attention in database and information retrieval fields. To this problem, the filter-and-refine framework is usually adopted by the existing research work firstly, and then various filtering methods have been proposed. Recently, tree based index techniques with the edit distance co...

متن کامل

Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints

A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantag...

متن کامل

Scaling Similarity Joins over Tree-Structured Data

Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-ofthe-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join res...

متن کامل

PASS-JOIN: A Partition-based Method for Similarity Joins

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long stri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014